Lesson 6. Spatial Queries

Spatial analysis is a process that begins with exploring a dataset and mapping data variables and ends with potentially complex models and visualizations of real world features and phenomena. Spatial queries are the building blocks of this process. These queries are software operations that allow us to ask questions of our data and which return data metrics, subsets or new data objects. In this lesson we explore the two basic types of spatial queries: measurement queries and relationship queries.


Instructor Notes


Types of Spatial Queries

The basic types of spatial queries are:

  • Measurement queries
    • What is feature A’s length?
    • What is feature A’s perimeter?
    • What is feature A’s area?
    • What is feature A’s distance from feature B?
    • etc.
  • Relationship queries
    • Does feature A intersect with feature B?
    • Is feature A within feature B?
    • Does feature A cross feature B?
    • etc.

Both of these types of queries operate on the geometry of features and are dependent on the type of geometry. For example, with point features you to explore distance measurement queries and ask what points are inside or outside of specific polygon objects. Polygon features, on the other hand, allow for a wider range of both measurement and spatial relationship queries.

An important distinction between these two types of queries is that measurement queries depend on the CRS of the data while spatial relationship queries do not. This is because topological relationships, the term used to describe spatial relationships, are invariant to rotation, translation and scaling transformations like those from that CRS transformations.

We’ll work through examples of each of those types of queries.

Then we’ll see an example of a very common spatial analysis that is a conceptual amalgam of those two types: proximity analysis.

6.0 Load and prep some data

Load the libraries we will use.

library(sf)
library(tmap)

Let’s read in the CA census tracts data and then take a look at its geometry and attributes.

census_tracts = st_read("notebook_data/census/Tracts/cb_2013_06_tract_500k.shp")
## Reading layer `cb_2013_06_tract_500k' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/census/Tracts/cb_2013_06_tract_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 8043 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 42.00952
## Geodetic CRS:  NAD83
plot(census_tracts$geometry)

head(census_tracts)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.3049 ymin: 37.76456 xmax: -122.2074 ymax: 37.84851
## Geodetic CRS:  NAD83
##   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID NAME LSAD   ALAND
## 1      06      001  400300 1400000US06001400300 06001400300 4003   CT 1105329
## 2      06      001  400900 1400000US06001400900 06001400900 4009   CT  420877
## 3      06      001  402200 1400000US06001402200 06001402200 4022   CT  712082
## 4      06      001  402800 1400000US06001402800 06001402800 4028   CT  398311
## 5      06      001  404800 1400000US06001404800 06001404800 4048   CT  628405
## 6      06      001  406100 1400000US06001406100 06001406100 4061   CT 1843685
##   AWATER                       geometry
## 1      0 MULTIPOLYGON (((-122.2642 3...
## 2      0 MULTIPOLYGON (((-122.2856 3...
## 3      0 MULTIPOLYGON (((-122.304 37...
## 4      0 MULTIPOLYGON (((-122.276 37...
## 5      0 MULTIPOLYGON (((-122.2182 3...
## 6  74875 MULTIPOLYGON (((-122.2387 3...

Then we’ll select just the Alameda County census tracts.

census_tracts_ac = census_tracts[census_tracts$COUNTYFP=='001',]
plot(census_tracts_ac)

6.1 Measurement Queries

We’ll start off with some simple measurement queries.

We can get the areas of each of our census tracts using the st_area function.

st_area(census_tracts_ac)[1:10]
## Units: [m^2]
##  [1] 1106564.4  435823.9  693541.5  400641.5  618815.4 1962114.5  370336.5
##  [8]  478238.0  587716.8  857795.3

Okay!

We got…

numbers!

…?

Question

  1. What do those numbers mean?
  2. What are the units?
  3. And if we’re not sure, how might be find out?

Let’s take a look at our CRS.

st_crs(census_tracts_ac)
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

Wow! We’re working in an unprojected (aka geographic) CRS, with units of decimal degrees, but sf automatically gave us area measurements in square meters (rather than the rather irrational square degrees).

How did it do this? For unprojected CRS, sf calculates geodetic measurements (i.e. travel-distances across the earth’s curved surface). It uses the st_geod_area function for this; see the ?st_area documentation for details.


That said, when doing spatial analysis, we will almost always want to work in a projected CRS since many spatial operations do not work with geographic CRSs.

Time to project! We’ll transform the data to the UTM Zone 10N, NAD83 CRS (EPSG 26910) which is appropriate for Northern California.

census_tracts_ac_utm10 = st_transform(census_tracts_ac, 26910)

Now check it..especially look for the units.

st_crs(census_tracts_ac_utm10)
## Coordinate Reference System:
##   User input: EPSG:26910 
##   wkt:
## PROJCRS["NAD83 / UTM zone 10N",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4269]],
##     CONVERSION["UTM zone 10N",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-123,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9996,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",500000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["North America - 126°W to 120°W and NAD83 by country"],
##         BBOX[30.54,-126,81.8,-119.99]],
##     ID["EPSG",26910]]

Now let’s try our area calculation again.

st_area(census_tracts_ac_utm10)[1:10]
## Units: [m^2]
##  [1] 1105796.6  435518.4  693052.3  400361.5  618393.6 1960768.5  370083.9
##  [8]  477916.4  587326.6  857191.6

What if we compare areas calculated from our unprojected and projected CRS?

print(st_area(census_tracts_ac)[[1]])
## 1106564 [m^2]
print(st_area(census_tracts_ac_utm10)[[1]])
## 1105797 [m^2]

Hmmm… The numbers are a bit different…specifically…

st_area(census_tracts_ac)[[1]] - st_area(census_tracts_ac_utm10)[[1]]
## 767.8165 [m^2]

You may have noticed that our census tracts already have an area column in them.

Let’s also compare the calculated areas with the data in this column.

print(st_area(census_tracts_ac)[[1]])
## 1106564 [m^2]
print(st_area(census_tracts_ac_utm10)[[1]])
## 1105797 [m^2]
print(census_tracts$ALAND[1])
## [1] 1105329

Question

What explains the discrepancy? Which areas are correct? Which are incorrect?

Doing more

We can also sum the area for Alameda county by wrapping our area calculation in a call to sum.

sum(st_area(census_tracts_ac_utm10))
## 1948917581 [m^2]

We can look up how large Alameda County is to check our work.The county is 739 miles2, which is around 1,914,001,213 meters2. I’d say we’re pretty close!

# Sum the area of all Alameda county Census Tracts dynamically in square miles...
sum(units::set_units(st_area(census_tracts_ac_utm10),mi^2))
## 752.4813 [mi^2]
# Add the area of all Alameda County Census tracts to the data frame
census_tracts_ac_utm10$area_sqmi <-units::set_units(st_area(census_tracts_ac_utm10), mi^2)

# Check it by summing
print(sum(census_tracts_ac_utm10$area_sqmi))
## 752.4813 [mi^2]
# Take a look
head(census_tracts_ac_utm10,3)
## Simple feature collection with 3 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 561190.1 ymin: 4184071 xmax: 566536.6 ymax: 4189277
## Projected CRS: NAD83 / UTM zone 10N
##   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID NAME LSAD   ALAND
## 1      06      001  400300 1400000US06001400300 06001400300 4003   CT 1105329
## 2      06      001  400900 1400000US06001400900 06001400900 4009   CT  420877
## 3      06      001  402200 1400000US06001402200 06001402200 4022   CT  712082
##   AWATER                       geometry        area_sqmi
## 1      0 MULTIPOLYGON (((564745 4188... 0.4269505 [mi^2]
## 2      0 MULTIPOLYGON (((562861.1 41... 0.1681546 [mi^2]
## 3      0 MULTIPOLYGON (((561264.5 41... 0.2675890 [mi^2]

Calculating the area of features and adding it to the dataframe is a useful operation because it allows us to convert count variables like population to densities.


Also, you may have been wondering how R is managing to tell us the units of our measurements.

It turns out that sf depends on the units package to track units.

This is super convenient! But there is a gotcha:

# convert to square kilometers
sum(st_area(census_tracts_ac_utm10)) / (1000^2)
## 1948.918 [m^2]

Oops! Our manual conversion to square kilometers gave us the right number but kept the now-wrong units!

Here’s the proper way to convert:

units::set_units(sum(st_area(census_tracts_ac_utm10)), km^2)
## 1948.918 [km^2]

Much nicer! In case you’re wondering how we knew the right abbreviation to use for kilometers, check out the leftmost column in this reference table:

# View(units::valid_udunits())

Calculating Length or Permeter

As it turns out, we can similarly use the st_length operator to calculate the length or perimeter of a feature. Always take note of the output units!

st_length(census_tracts)[1:10]
## Units: [m]
##  [1] 5358.919 2757.904 5397.799 2682.912 3711.654 6409.748 2540.302 3090.729
##  [9] 4262.487 3936.085

6.2 Spatial Relationship Queries

Spatial relationship queries consider how two geometries or sets of geometries relate to one another in space. For example, you may want to know what schools are located within the City of Berkeley or what East Bay Regional Parks have land within Berkeley. You may also want to combine a measurement query with a spatial relationship query. Example, you may want to know the total length of freeways within the city of Berkeley.

Here is a list of some of the more commonly used sf spatial relationship operations.

  • st_intersects
  • st_within
  • st_contains
  • st_disjoint

These can be used to select features in one dataset based on their spatial relationship to another. In other works, you can use these operations to make spatial selections / create spatial subsets.

Enough talk. Let’s work through some examples.

What Alameda County Schools are in Berkeley?

First, load the CA Places data and select the city of Berkeley and save it to a sf dataframe.

places = st_read('notebook_data/census/Places/cb_2018_06_place_500k.shp')
## Reading layer `cb_2018_06_place_500k' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/census/Places/cb_2018_06_place_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1521 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.2694 ymin: 32.53416 xmax: -114.229 ymax: 41.99314
## Geodetic CRS:  NAD83
berkeley = places[places$NAME=='Berkeley',]
plot(berkeley$geometry)

Then, load the Alameda County schools data and make it a spatial dataframe.

schools_df = read.csv('notebook_data/alco_schools.csv')
schools_sf = st_as_sf(schools_df, 
                      coords = c('X','Y'),
                      crs = 4326)

Check that the two sf dataframes have the same CRS.

st_crs(schools_sf) == st_crs(berkeley)
## [1] FALSE

They don’t have the same CRS so we need to align them. Let’s transform (or reproject) the CRS of both of these dataframes to UTM10N, NAD83 (EPSG 26910). This is a commonly used CRS for Northern CA data.

# Transform data CRSs...
schools_utm10 <- st_transform(schools_sf, 26910)
berkeley_utm10 <- st_transform(berkeley, 26910)

If you look at the Schools data you will see that it has a City column. So we can subset the data by attribute to select the Schools in Berkeley. No need to do a spatial selection.

berkeley_schools = schools_utm10[schools_utm10$City=='Berkeley',]
dim(berkeley_schools)
## [1] 31  8
plot(berkeley_utm10$geometry)
plot(berkeley_schools$geometry, add=T)

That looks good and was relatively simple operation. But what if the schools data didn’t have that city column? How could we identify the schools in Berkeley spatially?

# Spatially select only the schools within Berkeley
berkeley_schools_spatial = schools_utm10[berkeley_utm10,]

# How many did we get
dim(berkeley_schools_spatial)
## [1] 32  8
# Map the results
plot(berkeley_utm10$geometry)
plot(berkeley_schools_spatial$geometry, add=T)

What just happened? Look closely at the spatial subset operation above.

Interestingly, we have one more school in Berkeley based on the spatial selection!? Let’s take a look.

plot(berkeley_utm10$geometry)
plot(berkeley_schools_spatial$geometry, add=T)
plot(berkeley_schools$geometry,col="red", add=T)

Let’s use an interactive tmap to zoom into the school that was not selected by attribute but was spatially selected.

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(berkeley_utm10) +
  tm_borders() +
tm_shape(berkeley_schools_spatial) +
 tm_dots(col="black", size=.3) +
 tm_shape(berkeley_schools) +
 tm_dots(col="red", size=.1)

In a similar fashion, we can use the st_disjoint operator to select only those schools NOT spatial within Berkeley.

NOTE: The default spatial select operator is st_intersects - any other operator must be specified.

# Select all Alameda County Schools NOT in Berkeley with the disjoint operator
berkeley_schools_disjoint = schools_utm10[berkeley_utm10, , op=st_disjoint]
plot(berkeley_schools_disjoint$geometry)
plot(berkeley_utm10, col=NA, border="red", add=T)
## Warning in plot.sf(berkeley_utm10, col = NA, border = "red", add = T): ignoring
## all but the first attribute

There is no need to memorize these spatial operators (aka predicates)! Here is a fantastic sf cheatsheet that lists and briefly explains all these common functions (and many more).


Protected Areas in Alameda County

Let’s load a new dataset, the CA Protected Areas Database (CPAD), to demonstrate these spatial queries in more detail.

This dataset contains all of the protected areas (parks and the like) in California.

We will use this data and the Alameda County Census Data that we created earlier to ask “What protected areas are within Alameda County?”

First load the CPAD data (CA Protected Areas Database).

pas = st_read('./notebook_data/protected_areas/CPAD_2020a_Units.shp')
## Reading layer `CPAD_2020a_Units' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/protected_areas/CPAD_2020a_Units.shp' using driver `ESRI Shapefile'
## Simple feature collection with 17068 features and 21 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -374984.2 ymin: -604454.8 xmax: 540016.3 ymax: 449743.2
## Projected CRS: NAD83 / California Albers

What is the CRS of the PAS data?

Let’s transform the data to match census_tracts_ac_utm10.

pas_utm10 = st_transform(pas, st_crs(census_tracts_ac_utm10))

Let’s plot the data in so that we know what to expect. CPAD is big so wait for it…

plot(census_tracts_ac_utm10$geometry, col='red')
plot(pas_utm10$geometry, col='green', add=T)

We can see from our map that some of the protected areas are completely within Alameda County and some of them overlap. To get both of these cases we use the st_intersects operator, which is the default. Let’s check it out.

pas_intersects = pas_utm10[census_tracts_ac_utm10,]  #st_intersects
pas_within = pas_utm10[census_tracts_ac_utm10, , op=st_within] #st_within

We can use tmap to explore the difference in the results from st_intersects vs st_within

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(census_tracts_ac_utm10)+
  tm_polygons(col="gray", border.col="grey") +
tm_shape(pas_intersects) +
  tm_borders(col="green") +
tm_shape(pas_within) +
  tm_borders(col='red')

What you can see from the above, is that by default, st_intersects returns the features that intersect but it does not clip the features to the boundary of Alameda County. For that, we would need to use a different spatial operation - st_intersection.

Let’s try it!

pas_in_ac = st_intersection(pas_utm10, census_tracts_ac_utm10)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

Great! Now, if we scroll the resulting sf object we’ll see that the COUNTY column of our resulting subset gives us a good sanity check on our results.

head(pas_in_ac)
## Simple feature collection with 6 features and 31 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 561576 ymin: 4184117 xmax: 565367.9 ymax: 4188588
## Projected CRS: NAD83 / UTM zone 10N
##        ACCESS_TYP UNIT_ID                     UNIT_NAME SUID_NMA AGNCY_ID
## 8478  Open Access   13443                Hardy Dog Park    19677     1228
## 11167 Open Access   47996                  Redondo Park    32418     1228
## 4954  Open Access   14781           Temescal Creek Park    26230     1098
## 3585  Open Access   29165 Cypress Freeway Memorial Park    17873     1228
## 8131  Open Access   13451           South Prescott Park    25676     1228
## 9319  Open Access   15535               Mandela Parkway    21738     1228
##                AGNCY_NAME AGNCY_LEV   AGNCY_TYP
## 8478     Oakland, City of      City City Agency
## 11167    Oakland, City of      City City Agency
## 4954  Emeryville, City of      City City Agency
## 3585     Oakland, City of      City City Agency
## 8131     Oakland, City of      City City Agency
## 9319     Oakland, City of      City City Agency
##                                                   AGNCY_WEB LAYER MNG_AG_ID
## 8478  http://www2.oaklandnet.com/Government/o/opr/index.htm  City      1228
## 11167 http://www2.oaklandnet.com/Government/o/opr/index.htm  City      1228
## 4954          http://www.ci.emeryville.ca.us/158/City-Parks  City      1098
## 3585  http://www2.oaklandnet.com/Government/o/opr/index.htm  City      1228
## 8131  http://www2.oaklandnet.com/Government/o/opr/index.htm  City      1228
## 9319  http://www2.oaklandnet.com/Government/o/opr/index.htm  City      1228
##                MNG_AGENCY MNG_AG_LEV  MNG_AG_TYP PARK_URL  COUNTY  ACRES
## 8478     Oakland, City of       City City Agency     <NA> Alameda  0.788
## 11167    Oakland, City of       City City Agency     <NA> Alameda  0.250
## 4954  Emeryville, City of       City City Agency     <NA> Alameda  1.527
## 3585     Oakland, City of       City City Agency     <NA> Alameda  1.278
## 8131     Oakland, City of       City City Agency     <NA> Alameda  4.269
## 9319     Oakland, City of       City City Agency     <NA> Alameda 12.470
##                          LABEL_NAME YR_EST                DES_TP GAP_STS
## 8478                 Hardy Dog Park      0            Local Park       4
## 11167                  Redondo Park      0            Local Park       4
## 4954            Temescal Creek Park   1972            Local Park       4
## 3585  Cypress Freeway Memorial Park      0            Local Park       4
## 8131            South Prescott Park      0            Local Park       4
## 9319                Mandela Parkway      0 Local Recreation Area       4
##       STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID NAME LSAD
## 8478       06      001  400300 1400000US06001400300 06001400300 4003   CT
## 11167      06      001  400300 1400000US06001400300 06001400300 4003   CT
## 4954       06      001  400900 1400000US06001400900 06001400900 4009   CT
## 3585       06      001  402200 1400000US06001402200 06001402200 4022   CT
## 8131       06      001  402200 1400000US06001402200 06001402200 4022   CT
## 9319       06      001  402200 1400000US06001402200 06001402200 4022   CT
##         ALAND AWATER        area_sqmi                       geometry
## 8478  1105329      0 0.4269505 [mi^2] POLYGON ((565274.7 4188544,...
## 11167 1105329      0 0.4269505 [mi^2] POLYGON ((565005.3 4188113,...
## 4954   420877      0 0.1681546 [mi^2] POLYGON ((563475.6 4188002,...
## 3585   712082      0 0.2675890 [mi^2] POLYGON ((562225.1 4184994,...
## 8131   712082      0 0.2675890 [mi^2] POLYGON ((561576 4184223, 5...
## 9319   712082      0 0.2675890 [mi^2] POLYGON ((562327.9 4184999,...

An overlay plot can also provide a nice check!

tm_shape(census_tracts_ac_utm10) + 
  tm_polygons(col='gray', border.col='gray') +
tm_shape(pas_in_ac) + 
  tm_polygons(col = 'ACRES', palette = 'YlGn',
              border.col = 'black', lwd = 0.4, 
              alpha = 0.8,
              title =  'Protected areas in Alameda County, colored by area')

st_intersects or st_intersection?

It really depends! But make sure you understand the difference. The st_intersects operator is fundamentally a logical, subsetting query that returns those features in the target dataframe that intersect the source geometry, in this case Alameda County census tracts. On the other hand, st_intersection returns a new spatial dataframe that includes geometries from one dataframe that have been clipped to the boundary of the other dataframe. It also includes the attributes from both dataframes.

Exercise: Spatial Relationship Query

It’s your turn.

Write a spatial relationship query to create a new dataset containing only the BART stations in Berkeley.

Run the next two cells to load datasets containing Berkeley’s city boundary and Alameda County’s schools and to reproject them to match the Berkeley_utm10 dataframe (EPSG: 26910).

Then in the following cell, write your own code to: 1. Spatially select the BART stations that are within Berkeley 2. Plot the Berkeley boundary and then overlay the selected BART stations.

To see the solution, look at the hidden text below.

# load the Berkeley boundary
bart_df = st_read("notebook_data/transportation/bart.csv")
## Reading layer `bart' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/transportation/bart.csv' using driver `CSV'
## Warning: no simple feature geometries present: returning a data.frame or tbl_df
bart_sf = st_as_sf(bart_df, 
                   coords = c('lon','lat'),
                   crs = 4326)
  
# transform to EPSG:26910
bart_utm10 = st_transform(bart_sf, st_crs(berkeley_utm10))

# display
head(berkeley_utm10)
## Simple feature collection with 1 feature and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 559347.6 ymin: 4188961 xmax: 567371.4 ymax: 4195617
## Projected CRS: NAD83 / UTM zone 10N
##      STATEFP PLACEFP  PLACENS         AFFGEOID   GEOID     NAME LSAD    ALAND
## 1213      06   06000 02409837 1600000US0606000 0606000 Berkeley   25 27127391
##        AWATER                       geometry
## 1213 18715614 MULTIPOLYGON (((559347.6 41...

Plot the data together

plot(bart_utm10$geometry)
plot(berkeley_utm10$geometry, border='blue', add=T)

Now, in the cell below, write the code to spatially select the Berkeley BART stations, then make the map.

# YOUR CODE HERE:

Solution hidden here!


6.3 Proximity Analysis

Now that we’ve seen the basic idea of spatial measurement and relationship queries, let’s take a look at a common analysis that combines those concepts: promximity analysis.

Proximity analysis seeks to identify all features in a focal feature set that are within some maximum distance of features in a reference feature set.

A common workflow for this analysis is:

  1. Buffer (i.e. add a margin around) the reference dataset, out to the maximum distance.
  2. Run a spatial relationship query to find all focal features that intersect (or are within) the buffer.

Let’s read in our bike boulevard data again.

Then we’ll find out which of our Berkeley schools are within a block’s distance (200 meters) of the bike boulevards.

bike_blvds = st_read('notebook_data/transportation/BerkeleyBikeBlvds.geojson')
## Reading layer `BerkeleyBikeBlvds' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/transportation/BerkeleyBikeBlvds.geojson' using driver `GeoJSON'
## Simple feature collection with 211 features and 10 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 561541.2 ymin: 4189007 xmax: 566451.6 ymax: 4193483
## Projected CRS: WGS 84 / UTM zone 10N
plot(bike_blvds$geometry)

Of course, we need to reproject the boulevards to our projected CRS.

bike_blvds_utm10 = st_transform(bike_blvds, st_crs(berkeley_utm10))

Now we can create our 200 meter bike boulevard buffers.

bike_blvds_buf = st_buffer(bike_blvds_utm10, dist=200)

Now let’s overlay everything.

tm_shape(berkeley_utm10) + 
  tm_polygons(col = 'lightgrey') + 
tm_shape(bike_blvds_buf) + 
  tm_polygons(col = 'pink', alpha = 0.5) +
tm_shape(bike_blvds_utm10) + 
  tm_lines() + 
tm_shape(berkeley_schools_spatial) + 
  tm_dots(col = 'purple', size=0.2)

Great! Looks like we’re all ready to run our intersection to complete the proximity analysis. At this point (pun intended) it is a spatial selection (or intersection) operation, just like what we have done previously in this lesson.

schools_near_blvds = berkeley_schools_spatial[bike_blvds_buf,]
# or
#schools_near_blvds = st_intersection(schools_sf_utm10, bike_blvds_buf)

Now let’s overlay again, to see if the schools we subsetted make sense.

tm_shape(berkeley_utm10) + 
  tm_polygons(col = 'lightgrey') + 
tm_shape(bike_blvds_buf) + 
  tm_polygons(col = 'pink', alpha = 0.5) +
tm_shape(bike_blvds_utm10) + 
  tm_lines() + 
tm_shape(berkeley_schools_spatial) + 
  tm_dots(col = 'purple', size=0.2) +
tm_shape(schools_near_blvds) + 
  tm_dots(col = 'yellow', size=0.2)

Also note that if we want to find the pairwise distance matrix of the shortest distances between our schools and the bike boulevards, we can use the st_distance function.

st_distance(berkeley_schools_spatial, bike_blvds_utm10)

Similarly, we could use the function st_nearest_feature to get a vector of the row index for the nearest bike boulevard to each school.

st_nearest_feature(berkeley_schools_spatial, bike_blvds_utm10)
##  [1]  71 171  39 136  42  30 163 172 127 171 189 156 137  33 187 197  50 207 169
## [20] 102 125 137   3 109 207  76 135 173  56 102 146  76

Exercise: Proximity Analysis

Now it’s your turn to try out a proximity analysis!

Run the next cell to load our BART-system data, reproject it to EPSG: 26910, and subset it to Berkeley.

Then in the following cell, write your own code to find all schools within walking distance (1 km) of a BART station.

As a reminder, let’s break this into steps: 1. buffer your Berkeley BART stations to 1 km (HINT: remember your units!) 2. use the schools’ within attribute to check whether or not they’re within the buffers 3. subset the Berkeley schools using the object returned by your spatial relationship query

  1. as always, plot your results for a good visual check!

To see the solution, look at the hidden text below.

# load the BART stations from CSV
bart_stations = read.csv('notebook_data/transportation/bart.csv')
# coerce to an sf data.frame and set CRS to 4326
bart_stations_sf = st_as_sf(bart_stations, 
                             coords = c('lon', 'lat'),
                             crs = 4326)

# transform to utm zone 10 n (epsg:26910)
bart_stations_sf_utm10 = st_transform(bart_stations_sf, crs=26910)

# subset to berkeley
berkeley_bart = st_intersection(bart_stations_sf_utm10, berkeley_utm10)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
# YOUR CODE HERE:
# buffer the BART stations to 1 km

# spatially select the schools within the buffers

# Map with tmap
# plot the Berkeley boundary (for reference) in lightgrey

# add the BART stations (for reference) to the plot in green

# add the BART buffers (for check) in lightgreen

# add all Berkeley schools (for reference) in black

# add the schools near BART (for check) in yellow

Solution hidden here!


6.4 Recap

Leveraging what we’ve learned in our earlier lessons, we got to work with map overlays and start answering questions related to proximity. Key concepts include:

  • Measuring area and length
    • st_area,
    • st_length
    • st_distance
  • Relationship Queries
    • st_intersects,
    • st_intersection
    • st_within, etc.
  • Buffer analysis
    • st_buffer

 D-Lab @ University of California - Berkeley
 Team Geo